Journal: bioRxiv
Article Title: Protein Language Models Outperform BLAST for Evolutionarily Distant Enzymes: A Systematic Benchmark of EC Number Prediction
doi: 10.64898/2026.03.31.715487
Figure Lengend Snippet: Cross-organism validation — prokaryotes. Horizontal bar chart comparing PLM (ESM2-650M + MLP, blue) vs BLAST-90K (amber) EC4 accuracy for 9 held-out prokaryotic proteomes at 50% sequence identity threshold. Organisms sorted by PLM−BLAST advantage (Δ pp annotated on right margin). Y-axis tick colors: dark grey = Bacteria, purple = Archaea. † E. coli K-12 is partially in-distribution (∲20% of proteome in training set)—interpret with caution.
Article Snippet: Note: E. coli K-12 (taxon 83333) is not fully held-out; approximately 20% of its proteome (∼1,303 proteins) is present in the 90K training set and results for this organism should be interpreted with caution.
Techniques: Biomarker Discovery, Sequencing, Bacteria